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Abstract 

Graph comparison is fundamentally important for many applications 
such as the analysis of social networks and biological data and has been 
a significant research area in the pattern recognition and pattern analy¬ 
sis domains. Nowadays, the graphs are large, they may have billions of 
nodes and edges. Comparison issues in such huge graphs are a challenging 
research problem. 

In this paper, we survey the research advances of comparison prob¬ 
lems in large graphs. We review graph comparison and pattern match¬ 
ing approaches that focus on large graphs. We categorize the existing 
approaches into three classes: partition-based approaches, search space 
based approaches and summary based approaches. All the existing algo¬ 
rithms in these approaches are described in detail and analyzed according 
to multiple metrics such as time complexity, type of graphs or comparison 
concept. Finally, we identify directions for future research. 


1 Introduction 

Comparing objects is one of the most frequently encountered tasks in comput¬ 
ing: information retrieval, pattern recognition, biology, computer vision, etc. 
A comparison problem occurs whenever an object or a piece of it needs to be 
mapped to another object or part of it. Graphs are an attractive representation 
and modeling tool since they allow simple, intuitive and flexible representations 
of complex and interacting objects. Consequently, object comparison leads gen¬ 
erally to a problem of graph comparison. Although significant progress has been 
made in graph comparison and related areas such as graph/subgraph isomor¬ 
phism, pattern matching, etc., the recent explosion of the size of data generated 
and manipulated daily by applications and human activities has given rise to 
the big graph data challenge. In fact, real-world graphs are large and even huge, 
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i.e., thousands, millions and even billions nodes and edges. Social networks, web 
graphs and protein interaction graphs are some examples. For these graphs, ex¬ 
isting solutions for graph analysis, mining, visualization, etc., do not scale at 
all. These algorithms must be revisited or even re-invented. 

Traditional graph comparison approaches are generally classified into two 
categories: exact approaches and inexact approaches. Exact approaches, such 
as graph isomorphism, sub-graph isomorphism and the maximum common sub¬ 
graph, aim to find out if an exact mapping between the vertices and the edges 
of the compared graphs or subgraphs is possible humus]. 

Inexact graph comparison aims generally to compute a distance between 
the compared graphs. This distance measures how much these graphs are sim¬ 
ilar and helps to deal with the errors and the noise that is inevitably intro¬ 
duced during the process needed to model objects by graphs. Inexact graph 
comparison is also useful for search/rank based applications where a distance 
between the compared objects is needed. In some applications, graph similar¬ 
ity measures are intended to compute relatively suboptimal distances [18] that 
are compensated by a large reduction of the computational complexity of the 
comparison process. Several graph similarity measures have been proposed in 
the literature and several approaches have been used including genetic algo¬ 
rithms mm, neural networks m, the theory of probability urn clustering 
techniques H3E5], spectral methods Eim, decision trees HEED], etc. We 
refer the reader to [HUMUS] for more exhaustive surveys. In order to cope 
with large graphs, new techniques, concepts and approaches have been proposed 
recently for performing graph comparison. Thus, in this paper we focus mainly 
on the solutions designed for large graphs. 

The aim of this paper is to provide a survey of recent and current develop¬ 
ment of graph comparison and pattern matching approaches on large graphs. 
We describe and analyze in detail the existing approaches and we categorize 
them into different classes. We also highlight the advantages, disadvantages 
and the differences between the approaches and identify direction for future 
research. 

The rest of the paper is organized as follows: Section [2] presents the prob¬ 
lem definition and preliminaries. Section [3] presents the different approaches 
that we have categorized, analyzed and described in detail in order to compare 
them and to show their advantages and disadvantages. A summary of these 
approaches is presented and some important problems of graph comparison and 
pattern matching deserving further research are proposed in Section]!] SectionE] 
concludes the paper. 

2 Problem Definition and Basics 

In this section we present some basic definitions related to graphs and their 
comparison problems. We rely mainly on the terminology used in [3123. So, 
all the definitions below are adapted from [31123] . 

Definition 1 A graph G is a 4-tuple G = (V, E, fv, /e), where V is a set of 
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nodes (also called vertices), E C V x V is a set of edges connecting the nodes, 
fv - V Ey an d Ie'-E^^e are functions labeling the nodes and the edges 
respectively where Ey and Tie are the the sets of labels that can appear on the 
nodes and edges, respectively. 

When omitting f E in the definition of G , we mean that Eg is an empty set 
and the graph is not edge labeled. So, when there is no ambiguity, the notation 
G = (V, E) defines vertex labeled graph. We will also use the terms vertex and 
node interchangeably in all this document. 

The edges of a graph may have a direction associated with them. In this 
case, the graph is directed. 

Generally, the number of vertices of a graph is called the order of the graph and 
the number of its edges is called the size of the graph. 

A graph that is contained in another graph is called a subgraph and is defined 
as follows: 

Definition 2 A graph G i = (Vj . E\ . fy, , f El ) is a subgraph of a graph G 2 = 
(V 2 ,E 2 J V2 ,f E2 ), denoted Gx C G2, if Vi CV 2 ,E 1 CE 2 n(V 1 xV 1 ), f Vl (x) = 
fv 2 (x)Mx G Vi, and /^((aqy)) = fE 2 ((x,y)) V{x,y ) G Ex. 

The distance between two nodes u and v in a graph G, denoted by dist(u, v), 
is the length of the shortest undirected path from u to v in G. The diameter of a 
connected graph G, denoted by do, is the longest shortest distance of all pairs of 
nodes in G, i.e., do = max (diet (u,v)) for all nodes u, v in G. The eccentricity 
of a vertex in a graph is its maximum distance from any other vertex in the 
graph. The vertices of the graph with the minimum eccentricity are the centers 
of the graph, and the value of their eccentricity is the radius of the graph. The 
maximum value of eccentricity equals to the diameter of the graph . 

Several applications that use graphs as a modeling tool such as pattern 
recognition, information retrieval, mining, etc., need to compare graphs. Graph 
comparison, also called graph matching, has been subject of several studies 
and surveys such as ng, ng, m and eg. Graph comparison approaches are 
generally classified into two categories: exact approaches and inexact or fault- 
tolerant approaches. Exact approaches refer to the methods used to find out 
if two graphs are the same [Tt)HT51fI51l73| . This means that we look for graph 
isomorphism. 

Fault-tolerant graph comparison aims generally to compute a distance be¬ 
tween the compared graphs. This distance measures how much these graphs are 
similar and is motivated mainly by three situations: 

• the process of modeling objects by graphs may be subject to noise and 
distortions. This means that a modeling process executed twice on the 
same object may return two slightly different graphs.The different stages 
of image encoding is perhaps the most illustrative example of such noise 
that graph comparison must deal with m- 
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• search/rank based applications such in database query processing or web 
search based applications need to compute a distance between the com¬ 
pared objects in order to rank the top-fc results 17511721 . 

• In some applications, graph similarity measures are intended to compute 
relatively suboptimal distances [18] that are compensated by a large re¬ 
duction of the computational complexity of the comparison process. 

In both approaches and depending on the application, we need either to compare 
two whole graphs or a query graph with a large graph. According to this, 
graph comparison methods can be classified into two categories: graph similarity 
measures and graph pattern matching methods. 

2.1 Graph similarity/dissimilarity measures 

The aim of similarity/dissimilarity measures is to quantify the degree of resem¬ 
blance between two graphs. The strongest similarity degree is graph "equality”, 
called graph isomorphism and defined as follows: 

Definition 3 A graph G 1 = (V 1 ,E 1 , f Vl , f El ) and a graph G 2 = ( V 2 , E 2 , /y 2 , f E2 ) 
are said to be isomorphic, denoted G\ = G2, if there exists a bijective function 
h : Vl ~t V 2 such that the following conditions are met: 

1. Mx &\i : f Vl (x) = fvAK x )) 

2. V(x,y) G E x : (h(x),h(y)) e E 2 and f El ((x,y)) = f E2 ((h(x),h(y))) 

3. V(h(x),h(y)) € E 2 : (x,y) € and f E2 ((h(x),h{y))) = f El ((x,y)) 

Several relaxed approaches, i.e., ’’fault-tolerant graph comparison”, are also 
proposed. They are useful for search/rank based applications where a distance 
between the compared objects is needed. In some applications, graph similar¬ 
ity measures are intended to compute relatively suboptimal distances [T8] that 
are compensated by a large reduction of the computational complexity of the 
comparison process. 

Several graph similarity measures have been proposed in the literature and 
several approaches have been used including genetic algorithms [4T[|69] > neural 
networks [51], the theory of probability am clustering techniques fl2lf66l . 
spectral methods [65j[74j, decision trees [49]|50], etc. We refer the reader to 
[Ql lTOlHSl[271176] for more exhaustive surveys. Some of the existing approaches 
try to extend to graphs some of the properties defined in metric spaces. 

Definition 4 A metric space is an ordered pair (M, d) where M is a set and d 
is a metric on M, i.e., a function 

• d(x,y ) > 0 (non-negativity), 

• d(x,y ) = 0 iff x = y (uniqueness), 

• d(x,y) =d(y,x) (symmetry) and 
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• d(x,z) < d(x,y) +d(y,z) (triangle inequality). 

Perhaps, the most referenced metric is edit distance which defines the simi¬ 
larity of graphs by the minimum costing sequence of edit operations that convert 
one graph into the other EOlEfl- An edit operation is either an insertion, a sup¬ 
pression or a re-labeling of a vertex or an edge in the graph. A cost function 
associates a cost to each edit operation. Figure [T] shows an example of edit 
operations that are necessary to get the graph G 2 from G\ with the suppression 
of two edges and a vertex and the relabeling of two vertices. 

a a 

V- 

G 1 G 2 

Figure 1: Example of edit operations [35] . 

Graph edit distance is a flexible graph similarity measure which is applicable 
to various kinds of graphs mmmm- It also defines a common theoretical 
framework that allows comparing different approaches of graph comparison. In 
fact, Bunke showed in [6. that under a particular cost function, graph edit dis¬ 
tance computation is equivalent to the maximum common subgraph problem. 
In [7], the same author shows that the graph isomorphism and subgraph iso¬ 
morphism problems can be reduced to graph edit distance. However, computing 
graph edit distance suffers from two main drawbacks: 

1. A high computational complexity. The problem of computing graph edit 
distance is NP-hard in general [81]. The most known method for com¬ 
puting the exact value of graph edit distance is based on A* [35] which 
is a best first search algorithm where the search space is organized as a 
tree. The root of the tree is the starting point of the algorithm. The inter¬ 
nal vertices correspond to partial solutions and leaves represent complete 
solutions. 

2. The difficulty related to defining cost functions [58]. 

The first drawback motivated several approximating solutions to compute graph 
edit distance. A comprehensive survey on graph edit distance and the ap¬ 
proaches proposed to compute it can be found in [30] . To overcome the second 
drawback and avoid the definition of edit costs, similarity measures that do not 
use edit operations are also proposed. In mi, the authors propose a graph dis¬ 
tance measure that is based on the maximal common subgraph of two graphs 
and prove that it is a metric, i.e., the measure satisfies the four properties of 
a usual metric namely: non-negativity, uniqueness, symmetry and triangle in¬ 
equality. However, computing the maximal common subgraph of two graphs has 
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a high computational complexity m- For this reason, Raymond et al. m pro¬ 
pose a modified version of the measure defined in m where an initial screening 
process determines whether it is possible for the measure of similarity between 
the two graphs to exceed a minimum threshold for which it is acceptable to 
compute the maximum common subgraph. This screening process is based on 
computing graph invariants. Graph invariants have been efficiently used to solve 
the graph comparison problem in general and the graph isomorphism problem 
in particular. They are used for example in Nauty [48] which is one of the most 
efficient algorithm for graph and subgraph isomorphism testing. A vertex in¬ 
variant, for example, is a number i{v) assigned to a vertex v such that if there is 
an isomorphism that maps v to v' then i(v ) = i{v'). Examples of invariants are 
the degree of a vertex, the number of cliques of size k that contain the vertex, 
the number of vertices at a given distance from the vertex, etc. Graph invariants 
are also the basis of graph probing |44j where a distance between two graphs 
is defined as the norm of their probes. Each graph probe is a vector of graph 
invariants. 

In [77], the distance metric based on the maximum common subgraph defined 
in mi is extended by a proposal to define the problem size with the union of 
the two compared graphs rather than the larger of the two graphs used in la¬ 
in [50], the authors show that we can evaluate graph distance with a high 
degree of precision by considering complex graph sub-structures in the distance. 
In fact, in some applications such as analysis of protein interaction graphs, some 
sub-structures of these graphs represent certain functional modules of cells or 
organisms. Hence, comparing these graphs in terms of substructure information 
is biologically meaningful [80] . The authors defined a new metric based on the 
concept of Structure Abundance Vector. Each element of a Structure Abun¬ 
dance Vector of a graph G contains the size of an occurrence of a predefined 
sub-structure in G. The Structure Abundance Vector is a generalization of the 
concept of graph invariants. 

More recently, kernel based similarity measures are also proposed Bnnrasg 
The main idea is also to define similarity of graphs based on the simi¬ 
larity of substructures of these graphs. 


2.2 Subgraph/Pattern matching 

Given two graphs Q and G, the graph pattern matching problem is to find all 
subgraphs of G that match Q. In other words, find all the embeddings of Q 
in G. Generally, Q is called the query graph or simply pattern and G is large 
compared to Q. The exact version of graph pattern matching is called Subgraph 
isomorphism and is defined as follows: 

Definition 5 A graph Q = (Vq, Eq, fy Q1 fE Q ) is subgraph isomorphic to a 
graph G = (Vg, Eg, fv G , fE a ) if there exists a subgraph G' of G such that Q 
and G' are isomorphic. 
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Subgraph isomorphism is an NP-complete problem m- The most known 
methods to enumerate the subgraphs of G that are isomorphic to a query Q are 
based on exploring search spaces. With these approaches, the number of possible 
matchings to be checked increases combinatorially with the number of nodes in 
the graphs. Even with the help of pruning methods that reduces the size of the 
search space [191173j , these methods for subgraph isomorphism checking remain 
impractical for large graphs such as social networks. Furthermore, these graphs 
are directed, i.e., (u,v) and (v,u) denote different edges, and edge labeled. 
Moreover, the considered graph patterns are not simple graphs. A pattern in 
this kind of applications is a ’’regular expression”-like graph where a node is 
labeled by a search conditions which specifies a set of possible values for the 
node and the edge. In HE an edge in query graph, is a directed edge and does 
not correspond to a direct edge between two nodes but to some reachability 
condition that means that the endpoint of the edge is reachable from the source 
node of the edge. This idea was extended in [85J by the introduction of a bound 
6 such that if there is an edge between two nodes in the query, these nodes are 
mapped into the data graph to two nodes reachable within S edges, i.e., the 
shortest path between the two nodes is at most 6. More recently, [] introduces 
’’regular expression”-like graph patterns that combine the concept of bounded 
edges of [55] with the power of regular expressions for defining the possible value 
taken by the labels of the nodes. 

Consequently, relaxed approaches that achieve a better time complexity and 
that are more adapted to these pattern-based applications are proposed. In this 
context, Graph simulation [371152] receives an increasing interest specially for 
social network analysis. Graph simulation is defined as follows: 

Definition 6 A pattern Q = ( Vq,Eq , fy Q , /e q ) matches a directed graph G = 
(Vg, Eg, /vg, /e g ) v i a simulation, denoted by Q < G, if there exists a binary 
relation S C Vq x Vg such that: 

1. for each u £ Vq, there exists v £ Vg such that (u,v) £ S; 

2. for each (u, v ) £ S, we have 

(a) fv Q (u ) = fv G {v); 

(b) for each edge (u,u') £ Eq there is an edge (v,v') £ Eg such that 
(• uv') £ S. 

The graph that corresponds to simulation S is called the match graph and 
is defined as follows: 

Definition 7 Let Q = (Vq, Eq, fv Q , fE Q ) be a query graph that matches a data 
graph G = (Vq, Eg, /y c , /b g ) via simulation S C Vq x Vq- The match graph 
that corresponds to S is a subgraph Gs of G such that Gs = ( Vs,Es), in which 
(1) a node v £ Vs iff it is in S, and (2) an edge ( v,v') £ E$ iff there exists an 
edge ( u,u ') £ Eq with (u,v) £ S and ( u',v') £ S. 
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Contrarily to isomorphism, when two graphs G\ and Gi match by simulation, 
a node of one graph may be mapped to several nodes in the second graph. In 
Figure [2j the query graph is isomorphic to subgraph G± but it matches by sim¬ 
ulation subgraphs G\ and G 2 . We note also that G 2 is not connected. 



Query 



Data graph 


Figure 2: Subgraph isomorphism Vs graph simulation. 

Note that a quadratic time algorithm for graph simulation is proposed in [37]. 


3 Approaches 

In this section, we review graph comparison and pattern matching methods that 
focus on large graphs. Existing approaches can be categorized into three classes: 
partition based approaches, search space based approaches and summary based 
approaches. Figure |3] summarizes the approaches that will be reviewed in the 
rest of this section. 

3.1 Partition-based Approaches 

The basic idea of these approaches is to decompose graphs into sets of sub¬ 
graphs and to compute the similarity between the initial graphs in function of a 
comparison between the obtained subgraphs. Partition-based approaches have 
two advantages: 

1. They have a polynomial time complexity and thus may be suitable for 
large graph comparison. 

2. They may highlight the existence of particular or meaningful structures 
within the compared graphs. These structures may enhance the accuracy 
of the comparison. 

The first partition-based approach dates back to the 80s with the work of 
Eshera and Fu [2T[[22] . The authors compute the edit distance between two 
attributed and directed graphs Gi and G 2 in polynomial time (0(n 2 x m 2 ){n + 
m)) in the worst case, where n is the order of the graph and m is its size). 
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Figure 3: Classification of graph comparison approaches 


In this approach, the edit distance between G\ and G 2 is mapped to the edit 
distance between their Basic Sub-Graphs called Basic Attributed Relational 
Graphs (BARGs) defined as follows: 

Definition 8 \21\\22) A Basic attributed relational graph (BARG or Basic 

graph) is a graph on the form of one level tree, i.e., it consists of a root node, the 
branches emanating from it, and the nodes on which these branches terminate. 

In other words, a BARG is a star structure composed of a root vertex, its 
outcoming edges and the leaves associated to these edges.The mapping between 
two sets of BARGs is achieved via the exploration of a state space organized as 
a directed acyclic labeled lattice. Each state of the lattice is labeled with the 
set of matched BARGs and denotes the reconstruction of a subgraph from the 
query graph and a subgraph from the target graph as well as the matching of 
their respective BARGs. An edge between two states is labeled by the cost of 
the transition between two states. The final distance between the two graphs 
corresponds to the shortest costed path in the lattice. It is determined by 
dynamic programming. 

In [68] . the authors consider pair of vertices and their connecting edges, 
called Relational Descriptions (RD)). They define a distance between two graphs 
based on the number of isomorphic RDs and prove that it is a metric. Given 
two graphs G\ and G 2 , the distance is defined by the number of RDs of G\ that 
are not mapped to subgraphs of G 2 and the number of RDs of G 2 that are not 
mapped to subgraphs of G\. 
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Figure 4: A graph and its decomposition into BARGs. 


In [621163] the authors propose a modification of the approach of Eshera and 
Fu PTK221 that considers undirected graphs and avoids the state exploration part 
of the distance computation. In this solution, an optimal match between the 
sets of star structures, called local structures, is obtained using the Hungarian 
algorithm msa. Given a source graph G i and a target graph G 2 , the nodes of 
G i are mapped to the nodes of G 2 using the Hungarian algorithm by defining 
a cost matrix that records for each vertex from Gi the edit operations that are 
needed to transform it to each vertex of G 2 . 

A similar approach in presented is |81j . In this case, the graphs are also 
undirected. They are decomposed into multisets of stars as in mm- In this 
approach, a star structure is defined around each vertex as in [621163] as follows: 


Definition 9 \8Vj A star structure s is an attributed, a single-level, rooted tree 
which can be represented by a 3-tuple s = (r,£,£), where r is the root vertex, £ 
is the set of leaves and £ is a labeling function. Edges exist between r and any 
vertex in £ and no edge exists among vertices in £. 

Figure [5] shows an example of a graph and its star decomposition. 

The edit operation between two stars is defined as follows: 

Definition 10 f 81}/ Given two star structures s i and S 2 , the edit distance be¬ 
tween s i and S 2 is: 


A(si,s 2 ) =T(ri,r 2 ) + d(£i,£ 2 ) 


where 


T{n,r 2 ) 


0 if £{n) = l{r 2 ), 
1 otherwise. 


d(£ 1 ,£ 2 ) = ||£ 1 |-|£ 2 || + ®t(£ 1 ,£ 2 ) 

SDt(£i,£ 2 ) =ma®{|* £l |,|tf fl2 |}-|* £l n^ 2 | 


4 /q is the multiset of vertex labels in £. 


The authors define the distance between two multisets of star structures. 
Subsequently, they define the mapping distance between two graphs based on 
the edit distance between their star representations using the Hungarian algo¬ 
rithm [4121 [55] . 
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Figure 5: A graph and its star decomposition according to [8T. The arrows 
indicate the root of the stars. 


In [79], the authors proposed an index based on the star subdivision pro¬ 
vided in m- This index is made up of two parts: an index for all distinct 
star structures from the given database, and an inverted list below each star 
structure. The star structures are sorted in alphabetical order. Each entry in 
the inverted lists contains the graph identity and the frequency of the corre¬ 
sponding star structure. All lists are sorted in increasing order of the graph 
size m■ However, enumerating all the different stars in a large graph database 
may produce a huge index which is not a practical solution. 

In |60j , the authors also propose a polynomial time graph matching distance 
based on subgraph matching using the Hungarian algorithm |42 [ l53 j . The sub¬ 
graphs are also stars but consider edge labels which is not the case with [81] 
and |62]|63]. Each star structure is embedded within a vector of probes. Each 
probe gives the number of times that a given label appears in the star. An 
example is described in Figure [6] 



Set of vertex and edge labels 


a 

b 

c 

d 

f 

m 

n 

p 

r 

probe vector of the star s3 

i 

0 

1 

1 

1 

0 

i 

2 

0 


Figure 6: A graph and its decomposition into probe vectors. 


Note that the decomposition into stars in the approaches of [621, [ST] and [60] 
induce more overlappings than the decomposition into BARGs of mmi as the 
number of BARGs is smaller than the number of stars in a graph. 

Another resembling distance is also defined in [38] where a different rep¬ 
resentation of the star structure is used. In this similarity measure, the star 
structure is called node signature and is represented by a vector containing the 
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label of the root vertex, its degree, and the set of labels of its incident edges. 
So, in this representation, the labels of the leaves of the star are not considered 
in the subgraph as illustrated in Figure [7] A distance between two node sig¬ 
natures is also defined and the distance between two graphs is then defined as 
an assignment problem in the matrix containing the distances between nodes 
signatures of the two compared graphs. 



Signature of node a Signature of node b Signature of node c 


a 2 



m 


c 

3 

n 

p 

p 

Signature of node f 


f 

2 

p 

r 



Figure 7: A graph and its decomposition into node signatures. 


In [78], the authors propose to decompose the compared graphs into k- 
Adjacent Tree ( k_AT ) patterns (like Q-Gram decomposition of strings [71]), 
then use the number of their common k_AT patterns for edit distance estima¬ 
tion. The adjacent tree of a vertex v ( AT{v )) in a graph G is a breadth-first 
search tree rooted at vertex v, the children of each node of AT{v ) are sorted by 
their labels in the graph. The A-adjacent tree of a vertex v ( k_AT(v )) in a graph 
G is the top A;-level subtree of AT(v) [78]. This means that the star structure 
of J2T]|22] and the related methods is a 1 _AT. 



Figure 8: A graph and its 2_ATs decomposition. 


The set of all A_ATs of a graph G is denoted A_ATs(G). An example is 
illustrated in Figure [8] The number of common k_ATs , i.e., \k-ATs(G\) D 
fc_ATs(G 2 )|, of two graphs is called the matching number of the two graphs and 
is used to estimate their edit distance using the following inequality: 

\k-ATs(Gi) n k.ATs(G 2 )\ > |V(Gi)| - GED(G lt G 2 ).2(A(Gi) - l) fc_1 . 

where G-E.D(Gi,G 2 ) is the edit distance between G 1 and G 2 and A(G'i) and 
A(G 2 ) are the maximum degrees of G\ and G 2 respectively with A(G 2 ) > 1 
and A(G 2 ) > 1. This estimation has proven to be sufficiently tight but only for 
sparse graphs m- 

To avoid the above cited drawback of tree-based g-grams, [83] proposes to use 
a decomposition into path-based g-grams. A path-based g-gram in a graph G is 
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a path of length q with no repeated vertex. The edit distance can be estimated 
with path based (/-grams with the following inequality: If G.ED(Gi,G 2 ) < r 
then Gi and G 2 share at least max(\Q Gl \ - r-D path (G i), |Qg 2 | ~ t■ D path (G 2 )) 
path based g-grams, where \Qg\ is the size of the multiset of path based g-grams 
in G and D pat h(G ) is the number of path based g-grams of Qg affected by an 
edit operation that occurs on G. D pat h(G ) can be computed by: 


E p ath{G ) — Ul(lX U £: y G | Qq | 

where Qq denotes the multiset of path g-grams that contain vertex u. To find 
the pairs of graphs that are within an edit distance of t, the authors propose to 
use either an inverted index that maps each path g-gram to a list of identifiers 
of graphs that contain this path g-gram or a prefix filter such as those used in 
string similarity measures Ca¬ 
in 1841 . the authors point-out that path-based g-grams still induce many 
overlapping structures. If there are some high-degree vertices, the estimated 
edit distance of the path-based g-grams is not tight. They propose to use a new 
g-gram based structure, called branch structure , so that a single edit operation 
can affect two structures at most allowing a tighter lower bound for edit distance 
than existing g-grams structures. A branch structure b is a vertex v and the 
multiset of edge labels incident to v. A branch is represented by b{v) = ( l v ,ES ), 
where l v = Ly{v) is the label of vertex v, and ES = {Le{c) \ edge e is adjacent 
to v } is the multiset of edge labels adjacent to v. An example is given in 
Figure [9] A branch structure is equivalent to the node signature introduced 
in [38] . Figure [9] shows an example of a graph and its branch structures. The 


a 



Branch of node a 
(a,{m, n}) 


Branch of node b 

(b,{m|) 


Branch of node c 

(c,{n, p, p}) 


Branch of node d Branch of node f 

(d,{p. r}) (f,{p, r}) 


9: A graph and its branch structures. 


authors define the edit distance between two branches as in [8T] and derives the 
distance between two multisets of branches B(G i) and B(G 2 ) as the minimum 
weighted match in the bipartite graph which vertices represent the branches 
of B(G\) and B(G 2 ) and edges represent transformations between any two 
branches (from B(G\) and B(G 2 ) respectively) weighted with their pairwise 
branch edit distance. For solving the assignment problem, the authors use the 
Hungarian algorithm [42j . The authors also prove that the obtained branch 
based distance is tighter that the star based distance of [81] . 

To simplify the processing of a query graph in a large graph database, in [M] 
the authors propose to use an i?-tree based index where each leaf is the set of 
branches of a graph of the database. An internal node of the tree is the union 
of the branches of its children. The query graph is processed by traversing 
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the index starting from the root. For an intermediate node, the branch based 
distance is computed between the query graph and the set of branches of the 
internal node. If this distance is greater than a given threshold, the subtree 
rooted at this internal node can be safely pruned. However, computing the 
distance with all the branch set of an internal node is not compatible with large 
databases and may induce an important overhead. 

In [82] . the authors propose to use variable-size non-overlapping partitions. 
The proposed partitioning is based on the half-edge concept defined as an edge 
with only one end node and denoted by (u, .). Based on this concept, the authors 
introduce the notion of half-edge graph, i.e., a graph that contains half-edges, 
and half-edge subgraph isomorphism defined as follows: 

Definition 11 A graph Q = (Vq, Eq, fy Q ) is half-edge subgraph isomor¬ 

phic to a graph G = ( Vq , Eg, fv G )> denoted as Q C G, if there exists an injection 


h : Vq ->• Vg such that (1) Mu G V Q ,h{u) G V G and fv Q {u) = f Va (f(u));(2) 
V(u,v) G E q , (f(u), f(v)) G E g and fv Q ((u,v)) = fv G {{f(u),f(v))); and (3) 
V(u, .) G Eq, ( f{u),w) G Eg and fv Q ((u, ■)) = fv G {{f(u),w)),w G V G \h{V Q ) 


Based on this, p?2] develops a partition-based similarity search framework that 
contains two phases: an indexing phase that can be performed offline and a 
query processing phase performed for each query. The indexing phase takes as 
input a graph database D and an edit distance threshold r and constructs an 
inverted index as follows: 

• For each data graph G G D, it first divides G into r + 1 partitions. Figure 
m gives an example of a graph partition into 2 half edge-subgraphs. 


a 
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Figure 10: A half-edge subgraph decomposition. 


• Then, for each partition, it inserts G’s identifier into the corresponding 
postings list of the partition. 

The processing of a query q , starts by probing the inverted index for candidate 
generation. For each partition p in the inverted list, it tests whether p is con¬ 
tained by the query. If so, the graphs in the postings list of p are filtered based 
on their size and their labels. If the filtering produces a result within r, the 
graph is produced as a candidate for the query. Finally, candidates are further 
examined with a classic graph edit distance algorithm. The main problem of 
this approach is related to the partitioning algorithm. In fact, such partitioning 
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is not unique for a given graph. Furthermore, the index is not practical for large 
graphs. 

In uni, the authors propose a graph decomposition into STwigs. An STwig 
is a two level tree structure, q = (r, L), where r is the label of the root node 
and L is the set of labels of its child nodes. Contrarily to the star structure 
used in m and |6()| . ST wigs do not overlap (regarding edges), they are edge 
disjoint stars as illustrated in Figure fTTl 


d 


a 



Decomposition 2 


Figure 11: A graph and two of its possible decomposition into STwigs. 

Clearly, such decomposition is not unique and different decompositions of 
the same query incur different query processing cost. So, IZQ1 proposes a query 
decomposition that minimizes the number of obtained STwigs. The authors 
proved that the minimum ST wig cover problem is polynomial equivalent to 
the minimum vertex cover problem. Consequently, they construct an ST wig 
cover from a vertex cover in polynomial steps using an existing 2-approximate 
algorithm |20l for the vertex cover problem. Given a query graph q , ]70l first 
decomposes q into a set of STwigs, then it uses exploration to find matches to 
each STwig. Exploration at this step avoids indexing on STwigs which is not 
feasible for billion node graphs. Finally, the approach joins the results to find 
the final solution. The authors also modified the 2-approximate algorithm for 
the ST wig cover to incur an ST wig order that optimizes the number of joins. 
In fact, it seems that given a set of STwigs produced by the decomposition step, 
an optimized order is the one that ensures that the root node of each ST wig is 
a leaf node of at least one of the already processed STwigs. 

In [39], the authors propose a tree g-gram like decomposition embedded in 
a vector representation. In this approach, each partition rooted at node u en¬ 
compasses the /i-hop neighbors of u, i.e., the set of nodes v whose distance 
from u is less than or equal to h. The partition is encoded within a multidi¬ 
mensional vector, called neighborhood vector and denoted R(u) for node u with 
R(u) = {(l, A(u,l))}, where l is a label presents in the neighborhood of u and 
A(u, l ) represents the strength of l in the neighborhood of node u and is obtained 
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by: 


h 

A ( u ,o = EX E *(*€£(«)) a) 

i=l d(u,v)=i 

In this formula, L(v) is the label set of node v, I(l £ L(v)) is an indicator 
function which takes the value 1 when l is in the label set of v and 0 otherwise. 
d(u, v) is the distance between u and v. a is a constant called the propagation 
factor that takes value between 0 and 1. Figure [T2] gives an example of a graph 
and the neighborhood vectors associated to each of its vertices with h = 2 and 
a = 0.5. The similarity between two neighborhood vectors R(u) and R(v) of 
two nodes u and v respectively is computed by the following cost function: 


C(u,v)= ^ M(A{u,l),A(v,l)) 

(2) 

l£R(u) 


M(x,y) = l X ~ y 

v J | 0 otherwise. 

(3) 


R(l)={<b,0.75>,<c,0.5>,<f,0.25>l 
R(2)=t<a,0.5>l 

R(3)=l<b,0.75>,<a,0.5>,<f,0.25>l 
R (4)={<a,0.25>,<c,0.5>,<f,0.5>} 
R(5)=l<b,0.5>,<c,0.5>,<a,0.25>l 

Figure 12: A graph and its neighborhood vectors (h = 2 and a = 0.5, Vertices 
are numbered to distinguish them). 

Using neighborhood vectors, the authors propose an algorithm that finds all 
the embeddings of a query graph Q in a target graph G as follows: 

1. compute the neighborhood vectors Rq(u ) and Rq{v) for all nodes u £ 
V(G), v £ V(Q), 

2. for each node pair u £ V'(G), v £ V(Q) s.t. L(v) C L(u), calculate 
the node matching cost, cost(u , v) as the difference of their neighborhood 
vectors, cost(u,v) = YlieR(v) M(Aq(v, l), Ag{u, l)). Obtaining for each 
v £ V(Q) a list List{v) of possible matching nodes such that List{v ) = 
{u £ V(G),cost(u,v) < e} where e is a similarity threshold. To speed up 
the computation of List{v ) for all v £ Vq, two kinds of indexes can be 
constructed offline for G: 

• a label-based index with a hash table corresponding to each label of 
G. This index is efficient if the labels are node selective. 

• structure-based index which is built on the neighborhood vectors. 


a 
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3. use dynamic programming to find the embeddings of Q in V from the final 
list of matched nodes for each node v G Vq. 

In HO] , the authors extend the approach proposed in [39] with an inference 
algorithm that iteratively boosts the score of more promising candidate nodes, 
considering both label and structural similarity. This approach, called NeMa 
is based on the neighborhood vector introduced in [35] with the slight difference 
that the neighborhood vector in NeMa gives more importance to the distance 
than to the labels. The authors motivate this by two remarks from real appli¬ 
cations: (a) if two nodes are close in a query graph, the corresponding nodes in 
the result graph must also be close. However, (6) there may be some differences 
in labels of the matched nodes due to noises and heterogeneity in data. 

The neighborhood of a node u in a graph G is given by Rg(u) = {< it', Pg{u, u') > 
}, where u' is a node within h- hops of u, and Pg(u,u') denotes the proximity 
of u' from u in G. 


P G (u,u) 


a d{u,n') if u ,j < ^ 

0 otherwise. 


(4) 


Where d(u, u') is the distance between u and v! . The propagation factor a is a 
parameter between 0 and 1; and h > 0 is the hop number delimiting the neigh¬ 
borhood. Given a matching function (f>, the matching cost of the neighborhood 
vectors of two nodes v and u = <j>{v) is given by: 


N<f,(u,v) 


J2v'eN(v) M(Pq(v, v '), Pg(u, 4 >{v'))) 

T, v 'eN(v) P Q( v ’ v ') 


(5) 


M is defined by Equation [3] 

The global cost C((j>) of the matching function cj> between the query graph 
Q and the target graph G is given by: 


C{<t>) = 5Z F 4>( v ^( v )) (6) 

vEVq 


where, F^iy, is the individual node matching cost between v and u defined 
as a linear combination of the label difference function and the neighborhood 
matching cost function via a parameter 0 < A < 1, whose optimal value is set 
empirically. 

F<j>{v, (j>(v)) = AA l (L q (v),L g (<I>{v)) + (1 - X)N cj> (v, <j>(v)) (7) 

The label difference function between two node labels is defined by the 
Jaccard similarity. 

To find a matching function <f> that minimizes C((f>), [40] uses a heuristic 
based on the max-sum inference problem in graphical models [59] . 
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3.2 State Space Exploring Approaches 

In these classes of methods, we find mainly graph pattern matching approaches 
where a number of candidates vertices, subgraphs or regions are explored in a 
large data graph to find the different embeddings of some query graph or the 
subgraphs that match a given graph pattern. 

In [34], the authors propose a solution, called TURBOisOi to robustly com¬ 
pute subgraph isomorphism with two mechanisms: a tree rewriting of the query 
graph and candidate region exploration. A candidate region for a query graph 
Q is a subgraph of the data graph G which may contain embeddings of the 
query graph. So, performing subgraph isomorphism search on all candidate re¬ 
gions will ensure that all embeddings can be obtained. However, minimizing the 
number of candidate regions and the size of each region is obviously important 
for faster matching. In order to minimize the size of each candidate region, the 
authors propose to : 

1. rewrite the query Q into an equivalent NEC (Neighborhood Equivalence 
Class) tree Q'. In Q' each set of vertices that have the same label and the 
same set of adjacent query vertices are merged into one NEC vertex. So, a 
NEC vertex is a compressed form of a set of vertices. Consequently, using 
Q' instead of Q, will accelerate the candidate region exploration process, 
since the number of vertices is smaller. 

2. construct candidate regions for the query Q in the data graph G by con¬ 
structing for each region a BFS search tree Tq from the root node u' s of 
the NEC tree Q' so that each leaf is on the shortest path from u' s . Then, 
for the start vertex v s of each target candidate region, identify candidate 
data vertices for each query vertex by simply performing depth-first search 
using Tq and starting from v s . 

Minimizing the number of regions comes through a careful choice of the root of 
the NEC tree. For this, TURBOiso ranks every query vertex u by Rank(u) = 
f re< deg(uj U ^ > where freq(G , l ) is the number of data vertices in G that have 
label l, and deg[u ) means the degree of u. This ranking function favours lower 
frequencies and higher degrees which will minimize the number of regions. 

When exploring candidate regions, TURBOiso also minimizes the number 
of enumerated partial solutions by ordering the NEC vertices by increasing 
sizes. Thus, paths involving fewer vertices are explored first, the space is pruned 
rapidly if no isomorphism is possible. 

23] introduces bounded simulation, an extension of graph simulation in¬ 
tended to deal with graph queries expressed with graph patterns. In this case, 
all graphs are directed and a pattern graph is defined as follows: 

Definition 12 f2^j A pattern graph is defined as P = (Vp, E p , fv P , fE P ), 
where 

1. Vp and Ep are the set of nodes and the set of directed edges, respectively, 
as defined for data graphs; 
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2. fv P is a function defined on Vp such that for each node u,fv{u) is the 
predicate of u, defined as a conjunction of atomic formulas of the form 
A op a; here A denotes an attribute, a is a constant, and op is a compar¬ 
ison operator <, <, >, >, =, 

3. f e p is a function defined on E p such that for each edge ( u,u') £ Ep, 
fE P {{u,u')) is either a positive integer k or a symbol *. 

Intuitively, the predicate fv(u) of a node u specifies a search condition and 
may induce several possible label values. The integer fE P {{u,u')) of an edge 
( u , u') means that the edge (u, u') can be matched to a path of length at most 
fE P (u,u'). A simple graph query corresponds to a graph pattern where fv(u) 
is simply the label of u and fE P {{u,u')) = 1. In bounded simulation, the term 
” bounded” relates to the bound piggybacked by each edge in the pattern. This 
bound is the maximum length of a path in the data graph that matches the 
edge of the pattern. Bounded simulation is defined as follows: 

Definition 13 1231/ A data graph G = ( V,E,fA ) matches the pattern query 

Q = (Vq , Eq , fv Q , f e q ) via bounded simulation, denoted by Q < G, if there 
exists a binary relation S CVq x V such that: 

• for each u £ Vq, there exists v £ V such that (u,v) £ S; 

• for each (u,v) £ S, (a) the attributes fA{v) of v satisfies the predicate 
fv Q {u) °f u; and (b) for each edge ( u, u ') in Ey Q , there exists a non 
empty path p = v/.../v' in G such that (u',v') £ S, and len(p) < k if 
fv Q {u, u') is a constant k. 

In this paper, the authors also introduce the concept of maximum match 
graph to represent the union of all matches of a query in a data graph. This 
means that bounded simulation will search for a unique result graph that encom¬ 
passes all the subgraphs that match the query pattern as illustrated in Figure 

m 
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Then, they propose an algorithm for incremental matching that avoids the 
cost related to re-computing the result graph when the graph data is modified. 
This ensures the scalability of the approach to large graphs. 

In j'4'5] . the authors focused on reducing the number of matches returned 
by graph simulation and bounded simulation, the extension of graph simulation 
proposed in [23] . This is achieved by enforcing two conditions: 

1. Duality which corrects the behavior of graph simulation concerning topol¬ 
ogy preservation of the query. In fact, as shown in Figure [2j graph sim¬ 
ulation may return a disconnected subgraph for a connected graph query 
which augments the number of matches. To avoid this, j45j proposes dual 
simulation defined as follows: 

Definition 14 Jj5jj A data graph G = (V,E,/a) matches the pattern 
Q = (Vq, Eq, f v , f e ) via dual simulation, denoted by Q -<f im G, if Q -< G 
with a binary match relation Sd C Vq x V, and for each pair (u,v) E So 
and each edge (u 2 , u) E Eq, there exists an edge (v 2 ,v) E E with (u 2 , V 2 ) G 

So- 

Thus, dual simulation requires that two related nodes have the same edges 
and by the way avoids to simulate a connected graph with a disconnected 
one. Accordingly, in the example of Figure [2] only subgraph G 1 is returned 
as the result graph match. 

2. Locality which reduces the diameter of the returned subgraph of bounded 
simulation. In fact, bounded simulation returns a maximum match that 
encompasses all the matches of the query. This maximum match is unique 
but may be a too large graph. Locality is enforced by requiring matches 
to be within a ball of radius equal to the diameter of the query. A ball is 
defined as follows: 

Definition 15 For a node v in a graph G and a non-negative integer r, 
the ball with center v and radius r is a subgraph of G, denoted by G\v,r], 
such that (1) for all nodes v' G G[v, r], the shortest distance dist(v , v') -< r, 
and (2) it has exactly the edges that appear in G over the same node set. 

Definition 16 A data graph G = (V, E, f a) matches the query pattern 

Q = (Vq, Eq, f v , f e ) via strong simulation, denoted by Q Af irra G, if there exist 
a vertex v G V and a connected subgraph G s of G such that: 

• Q Gs with the maximum match relation S; 

• G s is exactly the match graph of Q with S, and 

• G s is contained in the ball Gd[v, c2q] of center v and radius dQ the diam¬ 
eter of Q. 
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Data graph 


Figure 14: Bounded simulation Vs Dual and Strong simulation. 


Figure fldl illustrates an example adapted from [45| . In this example, with sim¬ 
ulation and bounded simulation the query graph matches all the data graph. 
However, dual simulation returns G 3 while strong simulation returns Gi- Note 
that the diameter of the query graph is 2. Subgraph isomorphism returns G\. 

The authors show that strong simulation has the same complexity than sim¬ 
ulation and bounded simulation while preserving graph topology. They propose 
a cubic-time algorithm that returns the set of subgraphs of a data graph that 
matches by strong simulation a graph query. The algorithm inspects the balls of 
radius equal to the query diameter and centred at each node of the data graph. 

In [25| , the authors propose strict simulation to further improve graph sim¬ 
ulation and adapt its computation within a vertex-centric Bulk Synchronous 
Parallel (BSP) programming model [75] used by several graph processing frame¬ 
works such as Pregel m- They introduce an extra step in the algorithm of 
strong simulation proposed in [JS]. Strict simulation reduces the size, i.e., the 
number of nodes, of the ball inspected by strong simulation. For this, the idea 
is to first compute the match for dual simulation before inspecting the balls. So, 
the balls are computed on the result of dual simulation and are consequently 
much smaller than those computed by strong simulation. Formally, strict sim¬ 
ulation is defined as follows: 

Definition 17 f'SSj A data graph G = (V,E,/a) matches the query pattern 
Q = (Vq, Eq, f v , f e ) via strict simulation, denoted by Q ~<ff m G, if there exists 
a vertex v £ V such that: 

• v S Vd where Gd{Vd, Ed, Id) is the result match graph with respect to 

Q G; 

• Q -<® im Gd[v, do\ where GdIv^q] is a ball extracted from Gd i and 

• v is a member of the maximum match graph. 

In the example of Figure [TH strong simulation will first compute the match 
graph for dual simulation, i.e., subgraph G 3 , and then begin inspecting the balls. 

[25] also proposes distributed algorithms to compute simulation, bounded 
simulation, strong simulation and strict simulation. 
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Similarly to strict simulation, [21] introduces tight simulation that improves 
strict simulation and approaches subgraph isomorphism. Tight simulation fo¬ 
cuses on reducing the number of the balls inspected by strict simulation. To do 
so, the authors propose to select a single vertex u of the pattern Q and to use 
it as a candidate match to the center of a potential ball in the data graph, u is 
chosen to be the vertex of minimum eccentricity, i.e., it is a center of Q , which 
has the highest ratio of degree to label frequency (in Q). This allows to reduce 
the radius of the balls and also their number. So, tight simulation is defined as 
follows: 

Definition 18 A data graph G = ( V,E,fA ) matches the query pattern 

Q = (V Q ,E Q ,f v ,f e ) via tight simulation, denoted by Q -<^ im G, if there are 
vertices u £ Q and u' £ G such that 

• u is a center of Q with highest defined selectivity; 

• (u,u') £ Rd where Rd is dual relation set between Q and G; 

• Q -<® m Gd[u' ,vq] where Gd[u', r q ] is a ball extracted from Gd(Vd, Ed, Id) 

which is the result match graph with respect to Q G, and rQ is the 

radius of Q, and 

• u' is a member of the resulting maximum match graph. 

In the example of Figure fill the node having label b is a center of the query 
graph and will be used to extract the balls in the result of dual simulation, i.e., 
the match graph G' 3 . The authors show that tight simulation has better results 
than strong simulation and strict simulation. 

3.3 Summary-based approach 

Graph summarizing/compression offers interesting perspectives for large graph 
storage and processing. A graph summarizing method that retains an ’’accept¬ 
able amount” of the graph properties may be used as a preprocessing step to 
several graph algorithms. The idea here is not to reduce the size of a huge graph 
just to minimize its storage requirement and to decompress the graph to pro¬ 
cess it. Rather, the aim is to obtain a compressed representation of the graph 
that can be used, instead of the original graph, by the processing algorithms, 
i.e., analysis, mining, comparison, querying, etc. In this vein, [14] proposes an 
algorithm that finds all frequent subgraphs in a database of large graphs where 
the database graphs are summarized. Summarizing is achieved by grouping the 
nodes that have the same label into supernodes as follows: 

Definition 19 (Summarized Graph) ■ Given a labeled graph G such that its 
vertices V(G) are partitioned into groups, i.e., V(G) = l'i(G), 14(G), ■ ■ ■ , 14(G), 
such that: (1) Vi(G) D Vj(G) = </>, 1 < i 7 ^ j < k 
(2) all vertices in 14(G), 1 < i < k, have the same labels. 

We can summarize G into a compressed version comp(G) where: 
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(1) compiG) has exactly k nodes V\,V 2 ,--- ,Vk that correspond to each of the 
groups ofV(G) (i.e., Vi(G) K > Vi). The label of Vi is set to be the same as those 
vertices in Vi(G), and 

(2) an edge ( Vi,Vj ) with label l exists in comp(G) if and only if there is an 
edge (u,u') with label l between some vertex u £ Vi(G) and some other vertex 
u'&V^G). 

The obtained summarized graphs may then be mined for frequent patterns using 
any existing algorithm. To ensure that all patterns are found, the authors do not 
systematically summarize all the graphs of the database, rather they proceed 
with several iterations each of which consists of two steps: 

• Step 1: For each G,; in a graph database D, randomly partition its vertex 
set V(Gi). 

• Step 2: Execute a pattern mining algorithm of the resulting summarized 
database. 

• Step 3: Compute the support of each resulting pattern in the original 
database, i.e., the number of graphs that contain the pattern. Discard the 
pattern if its support is lower than a predefined threshold. The number 
of iteration is controlled by the probability of missing a frequent pattern. 

In [24] . the authors observe that users typically adopt a class Q of queries 
when querying a data graphs G. They propose a graph compression preserv¬ 
ing queries of Q. This means that each query in Q returns the same result 
when applied to G and when applied to the compression of G. They define the 
compression functions for two kind of graph queries: reachability queries and 
pattern queries. Roughly speaking, for reachability queries which aims to define 
if a node is reachable from another, the compression function groups the nodes 
that have the same ancestors and the same descendants. For pattern queries, 
the compression function is equivalent to the one given by Definition 1191 

In (43] . the authors propose a new solution for the comparison of large 
graphs. Their approach relies on a compact encoding of graphs called prime 
graphs. Prime graphs are smaller and simpler than the original ones but they 
retain the structure and properties of the encoded graphs. An example of a 
graph and its prime is given in Figure [15] In [43], the authors propose to ap¬ 
proximate the similarity between two graphs by comparing the corresponding 
prime graphs. Their proposed approach involves the following steps: 

• Building the prime graph of the compared graphs. Prime graphs are 
obtained by modular decomposition of the original graphs. Modular de¬ 
composition is one of the most known graph decompositions [33) . It was 
introduced by Gallai [25] to solve optimization problems. Modular de¬ 
composition generates a representation of a graph that highlights groups 
of vertices that have the same neighbors outside the group. These subsets 
of vertices are called modules. The prime graph correspond to the graph 
obtained by compressing all the modules recursively. 
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• Partitioning the compared prime graphs into stars of modules as in |81j . 

• Computing the distance between two prime graphs based on the distance 
of each pair of the stars of modules. Given a query prime graph PG± and 
a target prime graph PG 2 , the nodes of PG± are mapped to the nodes of 
PG 2 using the Hungarian algorithm by defining a cost matrix that records 
for each star of modules from PG 1 the edit operations that are needed to 
transform it to each star of modules of PG 2 . 

• Solving the assignment problem by using the Hungarian algorithm [|45] to 
obtain the minimum distance. 



(a) A protein graph of 1818 nodes 
and 1833 edges from the RI 
database [3] - 
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(b) The corresponding prime 
graph having 271 nodes and 321 
edges. 


Figure 15: Example of a graph and its prime graph. 


4 Discussion 

Tables |T] [2J [3] summarize all the presented approaches within the three cate¬ 
gories: partition-based approaches, search-space exploring approaches and summary- 
based approaches, respectively. 

The tables summarize these approaches according to the following facets: 

• Graphs: the type of graphs on which the graph comparisons are performed: 
directed/undirected graph, labeled/unlabeled edges. 

• Decomposition unit: the type of graph partitioning given by the name of 
the subgraph structure. 

• Comparison concept: the type of similarity used for graph comparison. 

• Application: describes the application area of the approach. 
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• Program: the type of program, it can be sequential or parallel. 

• Size of the query: describes the range of the size of the graph query used 
for matching. 

• Size of data graph: describes the range of the size of the data graph used 
for matching. The size here is given in terms of the number of nodes and 
edges in the graph. It can be thousand (k), million (M) or billion. 

• Time complexity of the approach when computed. 


Throughout this survey, we can see that various solutions are considered and 
there is not a generic algorithm for graph comparison or graph pattern match¬ 
ing that takes into consideration any type of graph (labeled/unlabeled and di¬ 
rected/undirected). Partition based approaches become increasingly used for 
graph comparison and pattern matching approaches. In fact, these approaches 
have a good time complexity and are easy to project toward parallel algorithms. 
The problem of matching in partition-based approaches is simplified by decom¬ 
posing the graphs to be matched into smaller subgraphs. However, the best 
graph decomposition technique that should be adopted for computing distance 
remains an open problem for large graphs even if we note that the majority of 
partitioning approaches relay on a star decomposition. Besides, the approaches 
that use the Hungarian algorithm sna on a large cost matrix such as [8Tj suf¬ 
fer memory problems. Heuristics or other methods that compute the minimum 
cost while avoiding the construction of the cost matrix are appreciated. Also, a 
parallel version of the Hungarian algorithm that relies on a partitioning of the 
matrix storage and computation will scale these approaches to larger graphs. 

Furthermore, using partition based approaches in subgraph search is gener¬ 
ally associated with joins or indexing methods. Both of them are time consum¬ 
ing and complex tasks especially for large graphs. So, research must focus on 
methods to avoid them or develop them to deal with large graphs. 

We can also note that several graph matching techniques have not been 
investigated in large scale graphs Among these solutions we can cite clustering 
based methods and polynomial heuristics to the greatest common subgraph. 
Invariant-based graph comparison [48l[80] may also give good results. 

To cope with large graphs, one among the solutions is graph compression 
without loss of information and performing the matching on the compressed 
graph. However, it does not exist enough summary-based approaches. Reduc¬ 
ing and compressing a graph for graph matching is a very interesting approach. 
There are two benefits: obtaining more storage space in the hard disk and 
performing the matching in a compressed and reduced graph without decom¬ 
pression H31 . In addition, graph compression techniques that retain all the 
information of the original graphs and that can be used for matching remain a 
challenge. 

In the majority of approaches, the space complexity of graph matching has 
not been investigated. The different approaches do not deal much about space 
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Table 1: Summary of partition-based approaches 


Approach 

Graphs 

Decomposition 

unit 

Comparison 

concept 

Application 

Size of 
the query 

Size of 
data graph 

Program 

Time 

Complexity 

21 22 

directed 

labeled 

edges 

BARG 

Edit 

distance 

Image 

processing 

No experiment 

No experiment 

Sequential 

0(\Vg\ j \Vq\'(\Vg\ + \Vq\)) 

EH 

directed 

unlabeled 

edges 

Relational 

Description 

Number of 

common RDs 

Image 

processing 

No experiment 

No experiment 

Sequential 

Not computed 

OH 

undirected 

labeled 

edges 

Star 

Edit 

distance 

Probing 

Image 

processing 

4-12 nodes 

3-11 edges 

4-12 nodes 

3-11 edges 

Sequential 

O(VJ) 

[si] 

undirected 

unlabeled 

edges 

Star 

Edit 

distance 

Chemistry 

Networks 

5-65 nodes 

~ 30 edges 

1-80 nodes 

Sequential 

o<y«§) 

16211631 

undirected 

labeled 

edges 

Star 

Edit 

distance 

Image 

processing 

- 

8- 126 nodes 

9- 328 edges 

Sequential 

Not computed 

m 

undirected 

labeled 

edges 

Signature 

Edit 

distance 

Retrieving 

Image 

- 

9-417 nodes 

9-112 edges 

Sequential 

Not computed 

EH 

undirected 

unlabeled 

edges 

Half-edge 

subgraph 

Edit 

distance 

Chemistry 

Networks 

- 

40 - 100k nodes 

Sequential 


[83] 

undirected 

unlabeled 

edges 

path-based 

q-gram 

Edit 

distance 

Chemistry 

Networks 

- 

40 - 126 nodes 

Sequential 

0(t(|V g | + |Vq|)2o 9 |Vq|) 

EH 

undirected 

unlabeled 

edges 

fc.AT 

Edit 

distance 

Chemistry 

Networks 

- 

40 - 100k nodes 

Sequential 

Not computed 

[to] 

undirected 

unlabeled 

edges 

STwig 

Subgraph 

Matching 

Web 

Networks 

3-10 nodes 

10-20 edges 

80 - 4096K nodes 

Parallel 

o(kH 

[39] 

undirected 

unlabeled 

edges 

Neighborhood 

vector 

Edit 

distance 

Web 

Networks 

8-12 nodes 

172k-100000k nodes 
579k-213000k edges 

Sequential 

0(|Vb|.d h ) 

EH] 

undirected 

unlabeled 

edges 

N eighborhood 
vector 

Edit 

distance 

Web 

Networks 

3-7 nodes 

2M - 12M nodes 

11M - 20M edges 

Sequential 

0(\Vq 1.1 V| + I \Vq | .niQ.dQ) 

m 

undirected 

labeled 

edges 

Branch 

structure 

Edit 

distance 

Biology 

40k - 100k nodes 

40k - 100k nodes 

Sequential 

Not computed 


r: graph edit distance threshold. 

h: hops, d: the average degree of each node. 

dQ\ the maximum number of h — hop neighbors of each query node. 
rriQ maximum number of candidates per query node. 


















complexity and memory consumption of algorithms which are important per¬ 
formance metrics either in theory or practice coping with large graphs. 

The problem of matching dynamic graphs has not received enough interest 
in the literature. Currently with social networks and the web, graphs change 
continuously: new nodes and edges are added or deleted from the graph through 
time. The problem is then to take into consideration the evolution of dynamic 
graphs in graph comparison or pattern matching approaches. Apart from the 
work of [33] we found little literature on this question. 


5 Conclusion 

The dominance of graphs as a representation tool in real world applications 
demand new graph matching techniques, concepts, and languages to match 
large graph datasets efficiently. We have presented a review of recent works on 
graph comparison and graph pattern matching approaches on large graphs, high¬ 
lighting the different notions, techniques and concepts used for matching and 
their impact coping with large graphs. We classified the approaches into three 
categories: partition based approaches, search space exploring approaches and 
summary-based approaches. Each of them has its advantages and application 
areas. Many recent graph comparison and graph pattern matching approaches 
converge towards partitioning of the compared graphs. The problem is simpli¬ 
fied by decomposing the graphs to be matched into smaller subgraphs. However, 
these approaches are not always possible and there are few algorithms suitable 
for all kinds of graphs and applications. Globally and as discussed in the previ¬ 
ous section several problems and area of investigations deserve future research 
despite the substantial results of current and past investigations. According to 
the International Technology Roadmap for Semiconductors (ITRS), as many as 
6000 processors are expected on a single system-on-chip by the end of year 2026. 
Moreover, the memory size will follow the same trends. Thus, parallel graph 
matching algorithm is needed for the next generation in order to run quickly 
the matching processes and exploit efficiently the hardware resources such as 
the number of processors and memory size. Moreover, due to the huge size 
of graphs, compressing graphs for matching without decompression remains a 
challenging issue. Combining parallelism with compressing or partitioning is 
also very interesting. Furthermore, dynamic graphs and graphs in streaming 
applications are not sufficiently addressed in the actual research effort. 
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Table 2: Summary of search-space exploring approaches 


Approach 

Graphs 

Comparison Concept 

Application 

Size of 
the query 

Size of 
data graph 

Program 

Time 

Complexity 

m 

undirected 

unlabeled 

Subgraph 

isomorphism 

Biology 

2-15 nodes 

1-10 edges 

0.5M-4M nodes 

32M edges 

Sequential 

0(|V«4|) 

[23] 

directed 

labeled edges 

Bounded 

Simulation 

Web 

4-10 nodes 

lk-20k nodes 

19k-58k edges 

Sequential 

0(|V G ||£ G |+ 

ISqIIVgI 2 + IVqIIVgI) 

EU 

directed 

unlabeled 

Strong 

simulation 

Web 

3-15 nodes 

millions of nodes 

billions of edges 

Sequential 

Not computed 

[25] 

directed 

unlabeled edge 

Strict 

simulation 

Social 

Networks 

10-20 nodes 

millions of nodes 

billions of edges 

Parallel 

Not computed 

m 

directed 

unlabled 

Tight 

simulation 

Social 

Networks 

5-100 nodes 

millions of nodes 

billions of edges 

Parallel 

oW!\) 









Table 3: Existing summary-based approaches 


Approach 

Graphs 

Comparison Concept 

Application 

Size of 
the query 

Size of 
data graph 

Program 

Time 

Complexity 

[14] 

undirected 

labeled edges 

Subgraph 

mining 

Program 

data 

Not 

Necessary 

100-20k nodes 

220k edges 

Sequential 

Not computed 

HU 

directed 

labeled edges 

Compression 

preserving query 

Social 

Networks 

3-8 of nodes 

3-8 edges 

6k-2.4M nodes 

21k-5M edges 

Sequential 

0(\V(G)\'+ 

\V(G)\\E(G)\) 

EU 

undirected 

unlabled edges 

Prime 

graph 

Biological 

graphs 

8-34000 nodes 

9-33k nodes 

9-332k edges 

Sequential 

0(F+ 

]: V" (C7) | + |.E(G)|) 


k is the number of vertices in the largest prime graph. 







